Bayesian Covariate-Dependent Gaussian Graphical Models with Varying Structure

We introduce Bayesian Gaussian graphical models with covariates (GGMx), a class of multivariate Gaussian distributions with covariate-dependent sparse precision matrix. We propose a general construction of a functional mapping from the covariate space to the cone of sparse positive definite matrices, which encompasses many existing graphical models for heterogeneous settings. Our methodology is based on a novel mixture prior for precision matrices with a non-local component that admits attractive theoretical and empirical properties. The flexible formulation of GGMx allows both the strength and the sparsity pattern of the precision matrix (hence the graph structure) change with the covariates. Posterior inference is carried out with a carefully designed Markov chain Monte Carlo algorithm, which ensures the positive definiteness of sparse precision matrices at any given covariates’ values. Extensive simulations and a case study in cancer genomics demonstrate the utility of the proposed model.


Introduction
Undirected Gaussian graphical models (GGMs), also known as Gaussian Markov random fields, are one of the common tools to analyze multivariate data with complex structure and find many useful applications across biomedicine, finance, and public health.A GGM can simply be expressed as a multivariate Gaussian distribution with a sparse precision (inversecovariance) matrix.The zero entries of the precision matrix have probabilistic interpretation of conditional independence between the Gaussian random variables (nodes of a graph).
Moreover, all the conditional independence relationships can be directly estimated from the accompanying undirected graph for which a zero entry in the precision matrix corresponds to a missing edge in the graph.This equivalency, essentially reduces the problem of graph structure learning in GGMs to finding zeros in the precision matrix.
Many existing GGM approaches (Dobra et al., 2004;Sudderth et al., 2004;Meinshausen and Bühlmann, 2006;Yuan and Lin, 2007;Friedman et al., 2008;Scott and Carvalho, 2008;Dobra et al., 2011;Green and Thomas, 2013;Drton and Maathuis, 2017;Khare et al., 2018;Massam, 2018;Gan et al., 2019) assume an independent and identically distributed (i.i.d.) sampling scheme y i = (y i1 , … , y ip ) ~ N (0, Ω −1 ) for i = 1, … , n where Ω is the precision matrix.However, the independence assumption does not hold in many applications.For example, observations in multivariate time series data are not independent and exhibit temporal correlations; similarly for spatial data with spatial correlation.In addition, the assumption of identical distribution implies homogeneity across observations and is often violated as well.For instance, tumor heterogeneity is a well-known characteristic in cancer: patients with the same cancer-type can be rather different in their genetic/genomic architecture.Forcing the same GGM (i.e., the same precision matrix Ω) onto every patient is a restrictive modeling assertion, when modeling cancer genomic networks.
Attempts have been made to extend GGMs or other types of graphical models beyond i.i.d.data.If there is a natural grouping of the observations, multiple graphical models (Guo et al., 2011;Danaher et al., 2014;Oates et al., 2014;Peterson et al., 2015;Yajima et al., 2015;Xie et al., 2016;Ni et al., 2018;Shaddox et al., 2018) can be applied to learn group-specific graphs assuming observations within each group are i.i.d.. Another line of work incorporates additional covariates x i in estimating graphs.Conditional Gaussian graphical models (Rothman et al., 2010;Yin and Li, 2011;Bhadra and Mallick, 2013) are multivariate linear regression models with the error terms following an i.i.d.GGM (can be viewed as chain graphical models).While graph estimation is conditional on the covariates, they only enter the model via the mean structure.As a consequence, the graph topology and the precision matrix stay the same across observations.In this paper, we are taking a more direct approach, in the sense that the latent graph and hence the sparse precision matrix are explicit functions of covariates.
There are a few recent work in this direction.Liu et al. (2010a) proposed a tree-based method that partitions the covariate space into a finite number of subspaces by classification and regression trees and fits GGMs separately to subsets of data.However, the estimated graphs may be unstable and lack similarity for similar covariates due to the separate graph estimation, as reported by Cheng et al. (2014).Kolar et al. (2010) proposed a penalized kernel smoothing approach that allows the precision matrix to vary with covariates.Cheng et al. (2014) developed a conditional Ising model for binary data where the dependencies are linear functions of covariates.Although the methods of Kolar et al. (2010) and Cheng et al. (2014) allow edge strength to vary with covariates, graph structure is assumed to be constant across all observations.Recently, Ni et al. (2019) proposed a graphical regression framework that allows both edge strength and graph structure to vary with covariates in (directed) Bayesian networks.They assumed there exists a natural ordering of the nodes.Given this assumption, Bayesian networks can be written as systems of recursive linear regressions.A conditional independence function was then introduced to connect regression coefficients with covariates.
In this paper, we consider a general problem of estimating undirected GGMs conditional on covariates (GGMx).GGMx allows not only the edge strength (i.e., off-diagonal elements of precision matrix) but also the graph structure (i.e., sparsity pattern of precision matrix) to vary as functions of covariates, which is illustrated in Figure 1 with graphs with four nodes and two covariates.Figure 1 also illustrates the generative mechanism underlying GGMx: covariates x i generate sparse precision matrices Ω i (hence the graphs G i ), which in turn generate responses y i .The major challenge in this context is the positive definiteness constraint of precision matrices -a sine qua non for GGMs -in the presence of covariates.
We propose a simple strategy by specifying a matrix-valued function f(•), such that Ω i = f(x i ) is a positive definite matrix for any x i almost surely; along with the function f(•) consisting of a random thresholding component that encourages sparse precision matrix estimation, specifically enforcing the required zero-pattern that corresponds to missing edges.The sparse functional relationship between Ω i and x i allows for novel graph interpolation for an unseen observation at covariates x*.We show that the random thresholding gives rise to a discrete mixture of non-local priors (Johnson and Rossell, 2010) for precision matrices.
We also carefully design a Markov chain Monte Carlo (MCMC) algorithm for posterior inference, which guarantees to propose positive definite precision matrices for any x i .GGMx allows for subject-level inference on unknown graphs.Moreover, GGMx is a general class of graphical models, which subsumes at least five special cases including standard GGMs, group-specific GGMs (Guo et al., 2011;Danaher et al., 2014;Peterson et al., 2015), time-varying GGMs (Zhou et al., 2010), covariate-dependent GGMs (Kolar et al., 2010), and context-specific GGMs (Nyman et al., 2017).Extensive simulation studies show strong and robust performance of GGMx compared with competing methods.Using a cancer genomics case study, we demonstrate how GGMx can be used to infer subject-specific gene networks, which can facilitate deeper investigations in the genomic foundation of precision medicine.
The rest of this article is organized as follows.We introduce the background and notations in Section 2. We present the proposed GGMx in Section 3 and discuss the link between the random thresholding prior and non-local priors in Section 4. We summarize the posterior inference and graph interpolation in Section 5. We demonstrate the utility and robustness of GGMx with extensive simulation studies in Section 6. GGMx is illustrated by a real data application in Section 7. Section 8 provides our closing discussion.

Background and notation
A GGM is a multivariate Gaussian distribution with a sparse precision matrix.Let Y = Y 1 , …, Y p ∼ N 0, Ω −1 be multivariate Gaussian random variables with mean zero and precision matrix Ω = ω jk .Since the off-diagonal elements in Ω are proportional to partial correlations, a zero entry ω jk = 0 indicates that Y j and Y k are conditionally independent given all other variables.A GGM graphically represents the zero patterns of Ω by an undirected graph.An undirected graph G = (V, E) consists of a set of nodes V = {1, …, p} and a set of undirected edges E ⊆ j, k | j, k ∈ V .The nodes V represent the variables Y and an edge {j, k} is present in the graph if and only if ω jk ≠ 0. This is not an arbitrary way of drawing a graph.In fact, the conditional independence relationships that are encoded in the multivariate Gaussian distribution can be directly read off from G using the notion of graph separation.Importantly, learning the graph structure is equivalent to finding the zero patterns of Ω.
Under the Bayesian paradigm, several prior distributions (Roverato, 2002;Wang et al., 2012;Wang, 2015) for sparse precision matrices have been developed, which all take the same general form, with M + being the collection of positive definite matrices (PDMs).For example, G-Wishart prior (Roverato, 2002) assumes π( Ω ) to be a Wishart distribution W isℎart ⋅ | b, Ω 0 and M + ≔ M G + to be PDMs consistent with a graph G, Graphical spike-and-slab prior (Wang, 2015) replaces the doubleexponential priors in Bayesian graphical lasso by spike-and-slab priors, where v 1 ≫ v 0 .Priors on Ω can be defined either conditionally on the graph G or marginally; in what follows we do not use a model indicator parameter G but will infer the graph structures directly from the zero patterns in the precision matrices.

Gaussian graphical models with covariates
Let y 1 , … , y n be n realizations of a random vector Y = (Y 1 , … , Y p ).We assume an independent multivariate Gaussian distribution for each observation y i ∼ p y i | Ω i = N 0, Ω i −1 with the precision matrix Ω i = ω ijk , importantly, indexed by i = 1, … , n.A subject-level graph G i = (V, E i ) is embedded in the subject-level precision matrix Ω i : j, k ∈ E i if and only if ω ijk ≠ 0.
Without further modeling assumptions, Ω i cannot be estimated with a single observation i.
Let x 1 , … , x n be n realizations of covariates X = (1, X 1 , … , X q ).Note that when X = 1 (i.e., there is no covaraites), the proposed GGMx is reduced to standard GGMs; more discussion of special cases of GGMx will be given later.We model Ω i ≔ f(x i ) through a symmetric matrix-valued function f(•), which is estimable as a population-level parameter shared across all observations.
General construction of covariate-dependent priors.
The key is the construction of the function f( ⋅ ) = [f jk ( ⋅ )] such that Ω i = f x i is a PDM for any x i , i = 1, … , n.Let ℳ + denote the collection of all such functions.This can be achieved by specifying a prior f ~ II that assigns positive mass only on functions that satisfy such requirement, Π ℳ + = 1.We consider the following generalization of the prior density in (1) as, where π is a distribution on matrix-valued functions.Note that the support of π is not limited to ℳ + , offering great flexibility in the choice of π.For example, we can start from independent distributions a priori such that π(f) = ∏ j ≤ k π(f jk ); using this construction, the marginal distribution π(f jk ) need not be defined with a constrained range.Because of the deterministic relationship Ω i = f (x i ), prior π(f) induces a conditional prior on Ω i given x i .
Two additional critical properties are desired for f(•).(i) Smoothness -similar inputs should give rise to similar PDMs.Without smoothness, similar subjects may have vastly different networks, which is difficult to interpret in many applications including ours.(ii) Sparsityπ(f) should have positive probability on sparse PDMs.Sparsity is a common assumption in high-dimensional models including GGMs, which improves statistical efficiency and interpretability compared to dense models.In order to encourage sparsity of Ωi, a positive mass has to be placed on sparse PDMs a priori because otherwise there will be zero mass on sparse PDMs a posteriori even if data strongly favor sparse PDMs.To equip f(•) with these two properties, we decompose each off-diagonal element f jk (•) of f (•) into two components, where g jk ( ⋅ ) is some smooth function, the hard thresholding I(|g jk x i | > t jk ) promotes sparsity in f jk ( ⋅ ), and t jk is a random threshold, which can be interpreted as a minimum effect size of ω ijk .Specifically, whenever g jk x i is less than t jk in magnitude, the hard thresholding truncates f jk x i to zero and hence induces a missing edge between nodes j and k for subject i.Our use of a thresholding function to induce sparsity on precision matrices Ωi= f(x i ) is novel and crucially different from conventional GGM priors including the G-Wishart prior and the graphical spike-and-slab prior (Wang, 2015): in order to construct observationspecific graphs, conventional priors would require a latent indicator for each potential edge and each observation, which would greatly increase the model complexity.For example, in our application with multiple myeloma dataset, conventional priors would need n•p•(p-1)/2 = 79,728 latent indicators whereas the proposed GGMx needs much fewer p • (p -1)/2 = 528 thresholding parameters.Moreover, as will be introduced later, GGMx enables undirected graph interpolation for unseen covariates, a new feature that is difficult to obtain with conventional priors.Other choices of thresholding functions are possible such as soft thresholding and nonnegative garrote thresholding.The main motivation of choosing hard thresholding over the alternatives is its theoretical connection with mixture of non-local priors; see Section 4.
For the diagonal elements (inverse-partial-variance, Whittaker 2009) f jj (•) of f(•), we assume the following model to ensure its nonnegativity, f jj x i = exp{g jj x i } . (4) Note that unlike off-diagonal elements in (3), the diagonal element f jj (•) is not subject to thresholding.
Remark 1 Our formulation encompasses covariate-dependent priors on both the off-diagonal (inverse-covariance) and diagonal (inverse-partial-variance) elements, thus conducting both graphical and inverse-partial-variance regression, simultaneously.

Remark 2
The proposed prior has two advantages over the more commonly used G-Wishart prior: (i) the induced prior on Ω i from (2) explicitly incorporates covariates x i and (ii) the normalizing constant of G-Wishart is not a constant with respect to graph G and therefore comparing two graphs requires explicit evaluation of the intractable normalizing constant whereas π(f), due to the thresholding function, does not have such complication.
Given f(•) and X, the proposed GGMx satisfies functional Markov properties, e.g., the pairwise functional Markov property, which is stated formally in the following lemma.
The proof of Lemma 1 directly follows from the fact that f jk (X) = 0 implies there is a missing edge between nodes j and k given covariates X, which in turn implies that A natural choice of g jk (•) is a linear function g jk x i = β jk T x i although, in general, g jk (•) can be any smooth function.Given the limited sample size of the case study, we consider g jk (•) to be linear for parsimony (see Section 8 for a brief discussion on modeling a nonlinear g jk ) and interpretability (β jk are the rates of changes of ω ijk in x i ).If the focus is on learning the graph structure and strength, i.e., the off-diagonal elements of Ω i , one can further simplify the model by reducing diagonal elements g jj (x i ) to be constant with respect to the covariates.
GGMx is a fairly flexible class of models and has at least five special cases (see Table 1).
(i) If X only contains the intercept, then GGMx reduces to the case of the standard GGM because the graph is a function of a constant and hence is constant.(ii) If X is categorical, then GGMx is a multiple graphical model (also known as group-specific GGM) as the categorical covariate defines the groups.(iii) If X is univariate time points, then GGMx can be used for modeling time-varying GGMs (Zhou et al., 2010) by treating time as a covariate 1 .(iv) If the thresholds t jk 's are fixed to 0, then GGMx is a covariate-dependent GGM in which the strength of the graph varies continuously with the covariates but the structure is constant because a non-zero linear function is non-zero almost everywhere.(v) If X is a subset of Y, then GGMx can be interpreted as a context-specific GGM (Nyman et al., 2017) where the graph structure varies with (discretized) X.
Priors.We assign priors to β jk and t jk , which in turn define π(f).We assume an independent multivariate Gaussian prior β jk ∼ π(β jk ) = N(β jk | 0, τ jk I q ).The thresholding parameter t jk can be interpreted as the minimum size of off-diagonal elements of Ω i .Since its value is usually unknown in practice, we assign a truncated normal prior t jk ∼ π t jk = N μ t , σ t 2 I t jk > 0 to reflect the uncertainty.As we will show in the next section, the priors of β jk and t jk induce a mixture of non-local priors on Ω i .
To complete the prior formulation, for the hyperparamter τ jk , we assign a hyperprior where IG(a, b) denotes an inverse-gamma density with shape a and scale b, and C τ is the normalizing constant in (2), Including C τ in the prior of π(τ) serves to cancel out C τ −1 in (2) so that the full conditional of τ jk is inverse-gamma.Similar cancellation trick has been used and thoroughly investigated in Bayesian graphical lasso (Wang et al., 2012).
A schematic representation of the proposed GGMx is provided in Figure 2.

Theoretical Properties
We establish a general result of the connection between the proposed prior of precision matrices induced by ( 2) and (3) and non-local alternative priors in GGM.A non-local prior assigns a vanishing density (under the alternative hypothesis) to the neighborhood of the null hypothesis.In variable selection contexts, this density vanishes around 0 and therefore shrinks small effect to zero, which is appealing because we are interested in a parsimonious estimation of the graph (i.e., a sparse network).Non-local priors have been shown, both theoretically and empirically, to have superior performance over local priors in various applications including hypothesis testing, high-dimensional sparse regression, and Bayesian networks (Johnson andRossell, 2010, 2012;Altomare et al., 2013;Rossell and Telesca, 2017;Shin et al., 2018;Ni et al., 2019).However, to the best of our knowledge, all existing priors of sparse precision matrices in GGM (G-Wishart, Bayesian graphical lasso, and stochastic search structure learning prior) are local, i.e., π(Ω) does not approach 0 as ω jk → 0 for (j, k) ∈ E. Conceptually, local priors have a seemingly "contradictory" representation of one's prior belief.On the one hand, (j, k) ∈ E suggests ω jk is non-zero.
But on the other hand, local priors fail to assign zero mass at ω jk = 0; in fact, local priors often assign the maximum mass at zero.The practical implication of such "contradiction" is that local priors tend to favor denser models and be more susceptible to false discoveries compared to non-local priors especially for high-dimensional models like GGMx.
Let π θ and π t generically denote the priors for θ jk and t jk , θ jk ∼ π θ θ jk and t jk ∼ π t t jk .Let T = t jk .We now show the connection between non-local priors and the proposed prior of the following general form, and where ω jk = θ jk I θ jk > t jk , for j < k .
Note that the equations above have no reference to covariates.We deliberately do so for clarity and generality; all the following theoretical results apply to the marginal distribution π(Ω i ) in GGMx by letting θ jk = g jk x i = β jk T x i and ω jj = exp{g jj x i }.Conditional on t jk , the prior π θ induces a spike-and-slab mixture distribution, Slightly abusing the notations, let ω = ω 1 , …, ω M = ω 12 , …, ω 1p , ω 23 , …, ω 2p , …, ω p − 1, p be an M-dimensional vector containing upper-triangular elements of Ω with M = p 2 . Let S ⊆ 1, …, M denote the indices of non-zeros elements in Ω (or equivalently in ω), i.e. ω m = 0 if and only if m ∈ S c .Then the conditional prior of Ω given T can be written as a mixture over all possible subsets S, where g(T ) = ∫ π(Ω | T )I Ω ∈ M + dΩ is the normalizing constant and 2 {1,…,M} is the power set of {1, … , M}.Our main theorem shows that under very mild conditions, the marginal prior π(Ω) is a discrete mixture of non-local priors.Before we present the main theorem, we first state a lemma that is useful in proving the theorem.
Proof Consider The first inequality holds because diagonally dominant symmetric matrix is positive definite and the second inequality is true because θ jk ≤ λ implies ω jk ≤ λ by design and T is independent of ω jj and θ jk .If π θ has positive mass around zero and π d is not a point mass at zero, we can pick a sufficiently small (but positive) λ* > 0 such that the lower bound L(λ*) of g(T) is positive.Then it follows that E[1/g(T)] < ∞.
Theorem 1 The marginal prior π(Ω) is given by where π S (Ω) is the prior under the hypothesis H S : ω m ≠ 0, m ∈ S and ω m = 0, m ∈ S c .
Moreover, π S (Ω) is a non-local prior for any S ∈ 2 1, …, M \∅, that is, π S (Ω) 0 as ω m 0 for m ∈ S, provided (i) Pr(t = 0) = 0, (ii) π θ (•) is bounded and has positive mass near 0, and Proof The marginal distribution of Ω is given by We will show that for any m ∈ S and any sequence ω m .
Conditions (i) -(iii) in Theorem 1 are very mild and satisfied by a wide range of π t , π θ , and π d .Condition (i) is trivially satisfied if π t is continuous (e.g., gamma, inverse-gamma, log-normal, and truncated normal distributions).Condition (ii) holds for Cauchy, normal, and most of the scale mixtures of normal distributions such as Laplace, normal-gamma, and t distributions.Condition (iii) only excludes point mass at zero δ 0 (•) from all the possible choices of π d (•).

A simple illustrative example
As a concrete example, π S (Ω) is non-local under the prior distributions specified in Section 3, namely, π θ θ jk = N(0, τ), π d ω jj = log-normal(0, τ), and π t t jk = N μ t , σ t 2 I t jk > 0 .To visualize the proposed non-local prior, we consider a small precision matrix with p = 3 and perform a prior simulation to generate Ω from π(Ω), the procedure of which is a special case of the posterior simulation procedure (ignoring the likelihood) to be described in Section 5. We visualize π S (Ω) for S = {1, … , M}, i.e., a complete graph.The marginal densities of pairs of off-diagonal elements of Ω (normalized to partial correlations) are depicted in the top panel of Figure 3 which show vanishing density as ω jk approaches 0. By contrast, a local prior on Ω (simulated by fixing t jk = 0) has an increasing density as ω jk approaches 0 as shown in the bottom panel of Figure 3.
Remark The connection between non-local priors and random thresholding has been investigated in the regression context (Rossell and Telesca, 2017;Ni et al., 2019).We make a nontrivial extension to precision matrix estimation for undirected GGMs.One major difference between our theory and those in Rossell and Telesca (2017) and Ni et al. ( 2019) is the complexity of the intractable prior normalizing constant g(T) in ( 5).Intractable prior normalizing constant is a common challenge in standard Bayesian GGMs (Dobra et al., 2011;Wang et al., 2012;Wang, 2015), both theoretically and computationally.In order to show the equivalence between non-local priors and random thresholding for GGMs, we make extra assumptions, i.e., π θ (•) has positive mass around zero and π d ( ⋅ ) ≠ δ 0 ( ⋅ ), in order to bound E[1/g(T)].These mild assumptions are not required in previous works.Also note that Rossell and Telesca (2017) truncates probability density whereas we threshold the random variables.Consequently, the resulting marginal prior of Rossell and Telesca ( 2017) is a non-local prior while ours is a discrete mixture of the non-local prior and point mass at 0. Computationally, the issue of intractable normalizing constant is resolved by a carefully designed MCMC algorithm, which will be discussed in the next section.

Posterior Inference
The proposed GGMx is parameterized by three sets of parameters β jk j ≤ k , t jk j < k , and τ jk j ≤ k .The joint posterior distribution of these parameters is given by, where the right-hand side of this equation depends on x i through Ω i = f (x i ) and f(•) is defined by β jk j ≤ k and t jk j < k .The posterior inference of the model parameters is carried out by MCMC.We need to carefully choose a proposal distribution that can propose f ∈ ℳ + efficiently.This is not a trivial task because the probability that we generate f ∈ ℳ + is practically zero if we propose β jk and t jk from naive proposals such as standard random walks.Here, we introduce a proposal that always proposes f ∈ ℳ + .
For illustration, suppose we are currently updating the (j, k)th element of Ω i .Let ω i,−k,k denote the kth column of Ω i without the kth row and let Ω i,−k,−k denote the submatrix of Ω i without the kth row and column.Let The proposal standard deviations η t 2 and η β 2 can be set to achieve desired acceptance rate (say, 20%-40%).
(II) Update the hypervariances τ jk from the inverse-gamma full conditional,

Graph estimation.
A point estimate of G i can be obtained by thresholding the posterior probability of inclusion.Specifically, we select j, k ∈ E i if P r j, k ∈ E i | y i , x i > c where c ∈ [0, 1] is the probability cutoff2 .The posterior probability of inclusion can be approximated by the MCMC samples, where the superscript (r) indexes the posterior samples.

Graph interpolation.
Since the precision matrix Ω i = f (x i ) is modeled as a function of x i , we can interpolate a graph G * = (V, E * ) for an unseen observation at covariates x * .It is achieved through the posterior predictive distribution of f(•), which can be approximated by the MCMC samples, r) x * ≠ 0 .
Graph interpolation requires covariates x * only, since the right-hand side of the equation above does not depend on y * .In practice, this is a desirable property.For example, one can predict the gene network for new patients without sequencing the whole genome; the measurement of covariates (e.g., blood biomarkers) will suffice.

Simulation Setup
We assessed the utility and operating characteristics of GGMx in seven simulation scenarios with different levels of sparsity and types of covariates.The same size of the dataset in application was used: n = 151, p = 33, and q = 2 (q was set to 1 for the last scenario).
Note that even with a moderate dataset, the number of parameters β jk , t jk , τ jk that need to be estimated is , 772, which is substantially larger than the sample size.We focused on graph structure learning in the first five scenarios by assuming constant diagonal elements g jj (•) for simplicity; non-constant case (i.e., simultaneous inverse-partialvariance and graphical regression) will be considered in the last two scenarios.We fixed the probability cutoff c to be 0.5 in all scenarios.
Scenario I.-We generated the simulated data from our model.We randomly set 2% of β jkℓ for j < k to be ±1 with equal probability.We set t jk = 0.5 and all the diagonal elements of Ω i to be 1.The covariate x ij was generated from an uniform distribution x ij ∼ iid U( − 1, 1).The resulting precision matrix Ω i might not be positive definite for all observations i = 1, … , n.
We repeated the process until Ω i > 0, ∀i.Then the observation y i was drawn from normal Using the same procedure, we generated a similar independent dataset with sample size 50 for testing graph interpolation of GGMx.
Scenario II.-The procedure in Scenario I was inefficient to generate a denser network.
In addition, it may not mimic well the data in application.In this scenario, we used one posterior draw from GGMx applied to the multiple myeloma data as simulation truth.The true β jkl 's are shown as heatmaps in Figure 4a where ℓ = 1 corresponds to the intercept and ℓ = 2, 3 correspond to the two covariates.Since the heatmap of β jk1 is denser than those of β jk2 and β jk3 , there were more nearly constant edges than highly varying edges.The true t jk 's are shown in Figure 4b.The covariates x i of the multiple myeloma dataset was used.
And y i was drawn from the model Scenario III.-This scenarios considered a simulation truth from an ordinary GGM, i.e.Ω i = Ω, ∀i.We generated a true Ω as follows.

3.
Since Ω might not be positive definite, we kept adding 0.1I to Ω until Ω became positive definite.The resulting partial correlations were less than 0.4 in magnitude.Then we simulated y i ∼ iid N 0, Ω −1 and x ij ∼ iid U( − 1, 1).GGMx took the independently generated x ij as covariates, which were pure "noises" for constructing the graph of y i .
Scenario IV.-We extended Scenario III to multiple graphs with C = 3 groups.The sample size of each group was n 1 = 50, n 2 = 50, and n 3 = 51.Graph G 1 was generated as an Erdös-Rényi graph with connecting probability 10%, which led to 63 edges.We randomly turned 3 edges on and 3 edges off from G 1 to obtain G 2 and similarly constructed G 3 from G 2 .As a result, each pair of (G 1 , G 2 ) and (G 2 , G 3 ) shares about 90% edges whereas (G 1 , G 3 ) shares about 80% edges.Then given graphs, the precision matrices and observations y i were generated in the same way as Scenario III.To apply GGMx in this setting, we let x ij be a binary indicator such that x ij = 1 if observation i belongs to group j for j = 1, 2 and x ij = 0 for j = 1, 2 if observation i belongs to group 3. Scenario V.-We have considered continuous covariates (Scenarios I-II), a discrete covariate (Scenarios IV), or no relevant covariates (Scenario III).Here, we included a scenario with one continuous covariate and one discrete covariate.We generated the data by following Scenario I with one covariate replaced by a Bernoulli(0.5)variable and the corresponding coefficients β jkℓ 's set to ±0.5 with equal probability.
Scenario VI.-We considered a scenario without assuming ω ijj to be a constant; instead we set g jj (x i ) = 0.1 + 0.2x i1 + 0.2x i2 and ω ijj = exp g jj x i .For off-diagonal elements, we randomly included 2% of the edges and the corresponding β jkl for j < k was set to be 0.7.
The covariate x iℓ was generated from x iℓ ∼ iid 2Beta(2, 1).The resulting precision matrix Ω i might not be positive definite for all observations i = 1, … , n.We repeated the process until Ω i > 0, ∀i.Then the observation y i was drawn from normal Scenario VII.-To illustrate GGMx can be used to recover time-varying GGM, we reduced the number of covariate to q = 1 from Scenario VI.
Bayesian Gaussian graphical models (BGGMs) assume i.i.d.multivariate Gaussian likelihood and the G-Wishart prior on the precision Ω ~ W G (b, D) and a uniform prior on the graph G. G-Wishart prior is conjugate to the multivariate Gaussian likelihood.However, due to intractable prior normalizing constant of G-Wishart prior, non-trivial MCMC algorithm is required for posterior inference.We use an efficient trans-dimensional MCMC algorithm proposed by Mohammadi et al. (2015) based on a continuous-time birth-death process.
Graphical lasso (glasso) is a penalized likelihood approach that maximizes the objective function log | Ω | − tr(SΩ) − λ ∥ Ω ∥ 1 where S is the sample covariance matrix.The first two terms are the Gaussian log-likelihood and the last term is an ℓ 1 penalty, which induces sparsity in Ω.The optimization is solved using a coordinate descent algorithm.
Both BGGM and glasso assume i.i.d.sampling and are designed to infer networks that do not change with covariates.For a more fair comparison, we implemented the kernel graphical lasso (k-glasso) approach outlined in Liu et al. (2010a).K-glasso is a modification of glasso with the sample covariance matrix S replaced by a covariate-dependent covariance matrix via kernel smoothing.Specifically, let where || • || is the Euclidean norm, h > 0 is the bandwidth, and K(•) is a Gaussian kernel.Then a sparse estimate of Ω i is obtained by applying glasso with As pointed out in Section 3, the proposed GGMx is a multiple graphical model when the covariates are categorical.Multiple graphical models assume that observations are divided into C groups.The goal is to jointly estimate group-specific sparse precision matrices where n c is the sample size of group c, S (c) is the sample covariance matrix of group c, and P(•) is a penalty that encourages sparsity and similarity of for GGL.
Finally, MGGM uses local priors on sparse precision matrices (Wang, 2015) and can be thought as the local prior counterpart of the proposed method for the multiple graphs setting; comparisons with this method only pertain to Scenario IV.
For GGMx, we set the hyperparameters, a τ = b τ = 10 −1 , μ t = 1, and σ t = 0.2; these choices will be tested in sensitivity analyses at the end of this section.Both GGMx and BGGM were run for 10,000 iterations with 5,000 burn-in.The regularization parameter of glasso was selected by the stability approach (Liu et al., 2010b) implemented in the R package huge.The tuning parameters λ 1 and λ 2 of FGL and GGL were selected based on the approximated Akaike Information Criterion (AIC) as suggested by Danaher et al. (2014).A 20 × 20 grid evenly spaced between 0.05 and 0.5 for λ 1 , and between 0.001 and 0.01 for λ 2 , was used.Likewise, the tuning parameters λ i and h i of k-glasso were also selected based on AIC on a 20 × 20 grid [0.1, 1] × [0.1, 1] for each observation i = 1, … , n.All results were based on 50 repeat simulations.

Simulation Results
To assess the graph recovery performance, we computed true positive rate (TPR), false discovery rate (FDR), and Matthews correlation coefficient (MCC), where TP, FP, TN, and FN stand for true positives, false positives, true negatives, and false negatives.MCC takes value between −1 and 1 with 1 being perfect graph recovery and 0 being random guess.In addition, we scrutinized the edges with inclusion probability that is considerably affected by the covariates' value.Hence, we introduced another three measures: partial TPR (pTPR), partial FDR (pFDR), and partial MCC (pMCC) which are simply TPR, FDR, and MCC restricted to the edges with true frequency of inclusion across observations between 0.1 and 0.9.We report all the metrics in Figure 5. Overall, GGMx had robust, superior performance (with high true positive and low false discovery rates) across all scenarios.
In Scenario I, GGMx clearly outperformed BGGM and glasso in all six measures.This was expected because the data were generated from the proposed model and all edges were associated with covariates.BGGM and glasso assume i.i.d.sampling and therefore did not perform well.Although k-glasso was much better than BGGM and glasso, it is clear that GGMx performed significantly better than k-glasso in all metrics.In addition, GGMx can interpolate graph structure given new covariates.The results of graph interpolation (not shown) were very similar to those of graph estimation.
In Scenario II, it appeared that BGGM was comparable to GGMx in TPR, FDR and MCC.This is because in the simulation truth, there were much more nearly constant edges than highly varying edges.In many cases including the application, it is interesting to focus on highly varying edges as they are most differential across observations.Not surprisingly, GGMx had favorable performance compared to BGGM and glasso in terms of pTPR, pFDR, and pMCC.K-glasso was not able to pick up the signals in this scenario, which mimicked the real data.
In Scenario III where there was no relationship between graph and covariates, BGGM outperformed GGMx, glasso, and k-glasso.But GGMx still had a reasonably good performance with the lowest FDR and substantially better overall performance than glasso and k-glasso.
In Scenario IV (multiple graphical models), GGL, FGL, and MGGM had higher TPR compared to GGMx, however, at the price of higher FDR.Consequently, GGMx outperformed GGL, FGL, and MGGM in terms of FDR and the overall measure MCC.
In Scenario V, GGL and FGL were applied ignoring the continuous covariate whereas k-glasso was applied ignoring the discrete covariate.GGMx was able to simultaneously incorporate both continuous and discrete covariates in estimating graphs and therefore as expected it had the best performance compared to BGGM, glasso, k-glasso, GGL, and FGL in practically all measures.
In Scenario VI where the diagonal elements of Ω i were not constrained to be constants, the results were consistent with those in Scenarios I-V.BGGM and glasso outperformed k-glasso overall but k-glasso was much better with respect to the selection of the edges that have substantial variability (measured by pTPR, pFDR, and pMCC).The proposed GGMx was clearly the best in both overall measures and partial measures.For example, GGMx had considerably higher MCC as well as pMCC than all the competing methods.In addition, we also evaluated the estimation accuracy of Ω i by computing the mean squared error (MSE).
We again focused on edges with true frequency of inclusion across observations between 0.1 and 0.9.The resulting MSE was 0.10, 1.24, 0.44, and 0.82 for GGMx, BGGM, glasso, and k-glasso, which demonstrated the capability of the proposed GGMx in capturing the heterogeneity in Ω i .
In Scenario VII, the main conclusion stays the same as in Scenario VI although k-glasso had significantly reduced FDR, however, at the price of significantly reduced TPR.GGMx, on the other hand, demonstrated its stable performance across all scenarios and all measures.

Application in Multiple Myeloma
We present an application of GGMx in modeling transcriptomic regulation in multiple myeloma (MM) which is a late-stage malignancy of plasma cells.Recent research has shifted the focus from traditional "one size fits all" therapies to precision medicine strategies because MM is a highly heterogeneous genetic disease at an individual level (Hervé et al., 2011).To find better personalized treatment and more accurate prescriptive recommendations to MM patients, there needs to be a better understanding of the heterogeneity based on genomically defined pathways (Lohr et al., 2014).We use data generated by the Multiple Myeloma Research Consortium, a multi-institutional collaborative research effort collected data (among others) on gene expressions and clinical parameters from MM patients (Chapman et al., 2011).
We focus our analyses on the genes mapped to one of the most important pathways in MM, NF-κB signaling pathway.Activation of the NF-κB pathway has been implicated in MM, but the genomic foundation of such activation is only partially understood (Demchenko et al., 2010;Roy et al., 2018) 6.We use these two prognostic factors as covariates (q = 2).
The goal of this study was to infer subject-level gene expression networks whose structures are modified by the prognostic factors.After removing outliers and samples with missing gene expression or clinical information, we had n = 151 samples and p = 33 genes.We ran two separate MCMCs, each with 50,000 iterations, discarded the first 50% as burn-in 3. The resulting prior means (variances) of the minimum effect size t jk are 1.0 (0.22), 1.3 (0.63), and 1.6 (0.77).
and saved every 50th sample after burn-in.To check MCMC convergence, we calculated the potential scale reduction factor (PSRF, Gelman et al. 1992) for each entry in Ω i , i = 1, … , n.The median PSRF was 1.00 with interquartile range 0.01, which showed no lack of convergence.We then concatenated the two chains and all subsequent inference was based on the combined Monte Carlo samples.The probability cutoff c was chosen to control the posterior expected FDR at 1%.

Population-level inference
The estimated graphs had 30 edges per subject on average with minimum 20 edges (from a stage III patient) and maximum 37 edges (from a stage I patient).We summarized a population-level gene expression network G = (V, E) as the union of all networks across subjects E = ∪ i = 1 n E i .There were |E| = 42 edges in G.To visualize the graph variability, we computed the variance of edge inclusion.Specifically, let e jk = (e 1jk , … , e njk ) be a binary vector such that e ijk = 1 if {j, k} ∈ E i .Then for edge {j, k}, the variance of edge inclusion was defined as the sample variance of e jk .The population-level network was reported in Figure 7, with the edge width proportional to edge inclusion variability.We found 14 out of 42 edges with variance greater than 0.2 (note the maximum variance is 0.25 for Bernoulli random variable).These 14 edges appeared in about 30%-70% of the patients.In line with our simulation studies, traditional GGMs are unlikely to accurately capture these differential edges.

Subject-level inference
Next, we focus on the subject-level inference.We chose 6 representative patients, 2 from each stage, to show their respective networks in Figure 8.The values of their prognostics factors are represented by the dots in Figure 6.We set the edge width proportional to the absolute value of partial correlation ρ ijk = − ω ijk ω ijj ω ikk , and use solid lines to represent positive partial correlations and dashed lines negative partial correlations.
We highlight several interesting biological findings.RELB was found to be a highly connected gene across all patients (Figures 7 and 8).RELB is a core member of NF-κB family.Hence it is not surprising that RELB played an import role in NF-κB pathway.
In fact, many MM patients have abnormal NF-κB target gene expression, associated with genetic aberration of NFKB1 and NFKB2 (Annunziata et al., 2007).This further confirms our finding that RELB was consistently positively associated with NFKB1 and NFKB2.
In addition, NFKBIA is an inhibitor of NF-κB, which is consistent with our findings that NFKBIA was negatively associated with RELB across patients.It is also known that genes in the same family tend to be positively associated with each other.Our study found positive links, for example, BIRC2-BIRC3 and NFKBIA-NFKBIZ.As disease progresses, some paths get blocked and some new connections get acquired.Among others, the link between LTB and TNFRSF13B was found in stage III patients but not in stage I patients whereas the link between NFKBIL2 and MAP3K7IP2 was lost in stage III patients.While some of those links are well documented in the biological literature (Liu et al., 2017), their gain and loss mechanisms need further validation and investigation.
Finally, as new patients come into the clinic, GGMx can be used to quickly predict the individualized gene network only based on the blood test results of Sβ 2 M and serum albumin without the costly and time-consuming whole genome sequencing.For illustration, we picked two sets of covariates that were unobserved in our collected data; they are represented by triangles in Figure 6.The estimated gene expression network of the two hypothetical patients are shown in Figure 9, which was enabled by the unique feature of graph interpolation of the proposed GGMx.

Discussion
In this article, we introduce a general regression framework for (undirected) Gaussian graphical models with covariates (GGMx).This generalization of regular GGM beyond i.i.d.data allows the graph structure and strength to change with covariates and is particularly challenging especially in the undirected graph context due to the positive definiteness constraint of a precision matrix.We have addressed this challenge through a novel prior that is theoretically connected to non-local priors for precision matrices, paired with a carefully designed MCMC algorithm for efficient posterior inference.GGMx includes at least five special cases including standard GGMs, group-specific GGMs, time-varying GGMs, covariate-dependent GGMs, and context-specific GGMs.We demonstrated the utility and robustness of GGMx through extensive simulations and an application in precision oncology.Our GGMx framework is broadly applicable to many other scientific domains of interest.For example, in brain functional magnetic resonance imaging data, GGMx can be used to study how brain connectivity networks change with covariates such as time and stimuli.
We remark that covariance regression (Hoff and Niu, 2012;Fox and Dunson, 2015) is a closely related model.It is, however, fundamentally different from the proposed GGMx in at least two ways.First, covariance regression assumes the covariance matrix rather than the precision matrix to be a function of x i which takes a specific form, Σ i = Ω i −1 = Ψ + Λ (x i ) + Λ (x i ) T for some PDM Ψ and matrix-value function Λ(x i ).Second, covariance regression assumes a dense Σ i whereas GGMx allows Ω i to be sparse and moreover, the sparsity pattern can change with covariates.Note that zeros in Λ(x i ) generally do not translate to zeros in Σ i or Ω i .Therefore, it is not straightforward to extend the covariance regression framework to allow sparsity.
While our work is a useful first step for undirected graphical regression, there are several extensions and refinements possible.We have chosen the smooth covariate-dependent functions, g jk (•), to be linear for simplicity and parsimony.Same choice has been made by similar papers (Cheng et al., 2014).However, in general, it can be replaced by a nonlinear function.For example, letting x denote some basis expansion of x such as splines and wavelets, we can model g jk (x) = β jk T x and the same inference procedure with linear functions applies.We plan to incorporate nonlinearity in our future work.Furthermore, we have worked with a moderate number of variables due to several reasons.First, the number of parameters that need to be estimated in GGMx is on the order of p(p + 1)(q + 2) 2 + p(p − 1) 2 , which can be large even for a moderate number of variables and covariates.Second, from an application perspective, we focus on a specific signalling pathway in multiple myeloma, NF-κB for deeper scientific interpretations.The small sample size (relative to the number of parameters) does not allow for reliable inferences for a much larger number of variables (e.g., the entire transcriptomic profile).Finally, the scalability of the proposed GGMx also limits the number of variables under consideration.The scalability can be potentially improved by adopting more efficient MCMC algorithms such as Metropolis-adjusted Langevin algorithm (Roberts et al., 1996) or Hamiltonian Monte Carlo (Duane et al., 1987).Both algorithms take advantage of gradient information of the target distribution.However, the hard thresholding function in (3) is discontinuous.This difficulty can be potentially overcome by considering a continuous relaxation of the hard thresholding function (Cai et al., 2018).Another potential solution is resorting to variational Bayes algorithms, which approximate the posterior distributions by simpler variational distributions through minimizing the Kullback-Leibler divergence between them.We hope to address the scalability issue in our future work.A schematic representation of GGMx.Simulations.Operating characteristics averaged over 50 repeat simulations under seven scenarios.Graph interpolation is not shown as it is similar to graph estimation.pMCC is 0 when no missing edge is detected.pTPR, pFDR, and pMCC are not available for Scenarios III and IV.Population-level summary of gene expression network.The network is a union of all networks across subjects.The edge width is proportional to edge inclusion variability.Network interpolation for two sets of unseen prognostic factors, represented as triangles in Figure 6.Five special cases of GGMx.

Figure 1 :
Figure 1:Illustration of GGMx.Subject-level sparse precision matrices Ω i and graphs G i of Y vary with covariates X.The edge thickness is proportional to its strength of association ω ijk .Both edge strength and graph structure change with X. GGMx can also be viewed as a generative model: X generate graphs which in turn generate Y.

Figure 4 :
Figure 4: Simulation truths for Scenario II.Heatmaps of (a) true β jkl 's and (b) true t jk .They are one posterior draw from GGMx applied to the multiple myeloma data.

Figure 6 :
Figure 6: Observed prognostic factors are shown as crosses and dots.Dots are chosen as representative cases for network visualization in Figure 8. Triangles will be used to interpolate networks for unseen patients shown in Figure 9.The prognostic covariates space are partitioned into Stages I, II, and III, according to the International Staging System for multiple myeloma.

Figure 8 :
Figure 8:Subject-level networks for six representative patients, represented as dots in Figure6.The edge width is proportional to the absolute value of partial correlation.The sign of partial correlation is represented by line type: + solid line and -dashed line.
Figure 9: i = β jk ω ijk = β jk I β jk > t jk Group-specific GGMx i = c, c ∈ {1, … , C} g jk x i = β jkc ω ijk = β jkc I β jkc > t jkTime-varying GGMx i = x, time x ∈ ℝ g jk (x) = x ⋅ β jk ω ijk = x ⋅ β jk I x ⋅ β jk > t jkCovariate-dependent GGM t jk = 0 i = y i1 ⋅ β jk ω ijk = y i1 ⋅ β jk I y i1 ⋅ β jk > t jk Shaddox et al. 2018of observations can be represented by a categorical variable, GGMx is able to learn group-specific graphs.For comparison, we consider three alternative multiple graphical model approaches, the two penalized approaches proposed in Danaher et al. (2014), fused graphical lasso (FGL) and group graphical lasso (GGL), and the Bayesian multiple Gaussian graphical model (MGGM) proposed byShaddox et al. 2018.Both penalized algorithms maximize the following objective with respect to positive definite matrices . Clinical information includes measurements of two important prognostic factors, serum beta-2 microglobulin (Sβ 2 M) and serum albumin.The International Staging System (Greipp et al. 2005) uses these two prognostic factors to stage MM: stage I, Sβ 2 M < 3.5 mg/L and serum albumin ≥ 3.5 g/dL; stage II, neither stage I nor III; and stage III, Sβ 2 M ≥ 5.5 mg/L.The observed values of Sβ 2 M and serum albumin, and the staging partition are depicted in Figure

Table 2 :
Sensitivity Analysis.Operating characteristics for simulations under six alternative hyperparameter settings.The numbers are calculated on the basis of 50 repetitions; standard deviations are within parentheses.The first row shows the performance of GGMx in Scenario VII with default hyperparameter setting (a τ , b τ ) = (10 −1 , 10 −1 ) and (μ t , σ t ) = (1, 0.2).